This report describes an analysis of the pollution in Madrid between 2011 and 2016.
The dataset consists in:
The stations are located all across the city:
This analysis requires these R packages:
These packages are installed and loaded if necessary by the main script.
The pollution and weather data are first read from the input files, formatted, combined and aggregated, into the data frame pollution_daily_h which provides the averaged information per day.
The dataset contains information for 12 pollutants: NO2, SO2, O3, PM2.5, BEN, CO, EBE, NMHC, NO, PM10, TCH, TOL.
The workflow to prepare the data is as below
Additional variables have been added:
The data frame pollution_daily_h is structured as below:
str(pollution_daily_h)
## 'data.frame': 2192 obs. of 22 variables:
## $ date : Date, format: "2011-01-01" "2011-01-02" ...
## $ NO2 : num 41.5 48.5 63.6 46.3 51.5 ...
## $ SO2 : num 10.71 11.93 11.91 8.84 9.51 ...
## $ O3 : num 20.47 15.56 9.45 13.34 10.88 ...
## $ PM2.5 : num 9.36 9.08 11.94 9.4 10.51 ...
## $ BEN : num 0.765 0.869 1.176 0.857 0.856 ...
## $ CO : num 0.372 0.453 0.532 0.367 0.395 ...
## $ EBE : num 0.595 0.722 1.047 0.788 1.074 ...
## $ NMHC : num 0.187 0.208 0.246 0.196 0.196 ...
## $ NO : num 16.5 28.8 71.1 27.8 36 ...
## $ PM10 : num 13.8 13.7 21.6 14.4 15.6 ...
## $ TCH : num 1.43 1.62 1.62 1.46 1.52 ...
## $ TOL : num 1.66 2.14 3.56 2.4 3.19 ...
## $ month : Date, format: "2011-01-01" "2011-01-01" ...
## $ week : Date, format: "2010-12-27" "2010-12-27" ...
## $ temp_avg : num 8.3 8.6 4.2 6.5 8.9 12.2 10.9 9.8 8.4 6.9 ...
## $ temp_max : num 13 13 9.4 8 10 15 13 13 11 8.3 ...
## $ temp_min : num 3 4 -1.6 4.1 6.3 8.9 8.7 7.6 6.6 5.2 ...
## $ precipitation : num 0 0 0 0 0 ...
## $ humidity : num 84 81 86 93 90 87 81 88 92 91 ...
## $ wind_avg_speed: num 5.2 5.4 3.5 6.3 10.4 15.7 15.6 14.3 6.5 7.4 ...
## $ temp_gap : num 10 9 11 3.9 3.7 6.1 4.3 5.4 4.4 3.1 ...
summary(pollution_daily_h)
## date NO2 SO2 O3
## Min. :2011-01-01 Min. : 7.819 Min. : 1.679 Min. : 3.979
## 1st Qu.:2012-07-01 1st Qu.: 26.170 1st Qu.: 3.757 1st Qu.: 30.088
## Median :2013-12-31 Median : 35.646 Median : 5.483 Median : 49.555
## Mean :2013-12-31 Mean : 38.828 Mean : 5.889 Mean : 47.866
## 3rd Qu.:2015-07-02 3rd Qu.: 48.122 3rd Qu.: 7.388 3rd Qu.: 65.058
## Max. :2016-12-31 Max. :105.077 Max. :17.129 Max. :110.287
## PM2.5 BEN CO EBE
## Min. : 2.625 Min. :0.1514 Min. :0.1496 Min. :0.1028
## 1st Qu.: 7.306 1st Qu.:0.4520 1st Qu.:0.2575 1st Qu.:0.3318
## Median :10.024 Median :0.6152 Median :0.3075 Median :0.6782
## Mean :11.112 Mean :0.7392 Mean :0.3574 Mean :0.6789
## 3rd Qu.:13.646 3rd Qu.:0.8911 3rd Qu.:0.4058 3rd Qu.:0.8651
## Max. :58.556 Max. :2.8132 Max. :1.1792 Max. :3.2252
## NMHC NO PM10 TCH
## Min. :0.04319 Min. : 2.351 Min. : 3.618 Min. :1.063
## 1st Qu.:0.15269 1st Qu.: 6.973 1st Qu.: 12.730 1st Qu.:1.327
## Median :0.18694 Median : 11.814 Median : 18.556 Median :1.394
## Mean :0.20500 Mean : 23.681 Mean : 20.805 Mean :1.421
## 3rd Qu.:0.23620 3rd Qu.: 26.178 3rd Qu.: 25.560 3rd Qu.:1.482
## Max. :0.66167 Max. :214.569 Max. :216.628 Max. :2.412
## TOL month week
## Min. : 0.2667 Min. :2011-01-01 Min. :2010-12-27
## 1st Qu.: 1.6969 1st Qu.:2012-07-01 1st Qu.:2012-06-30
## Median : 2.4893 Median :2013-12-16 Median :2013-12-30
## Mean : 3.0021 Mean :2013-12-16 Mean :2013-12-28
## 3rd Qu.: 3.6722 3rd Qu.:2015-07-01 3rd Qu.:2015-06-29
## Max. :15.6062 Max. :2016-12-01 Max. :2016-12-26
## temp_avg temp_max temp_min precipitation
## Min. :-0.50 Min. : 3.30 Min. :-7.000 Min. : 0.0000
## 1st Qu.: 8.40 1st Qu.:14.00 1st Qu.: 3.000 1st Qu.: 0.0000
## Median :14.60 Median :21.10 Median : 8.900 Median : 0.0000
## Mean :15.45 Mean :22.04 Mean : 8.755 Mean : 0.9173
## 3rd Qu.:22.70 3rd Qu.:30.10 3rd Qu.:14.600 3rd Qu.: 0.0000
## Max. :32.30 Max. :42.00 Max. :25.800 Max. :48.0100
## humidity wind_avg_speed temp_gap
## Min. :15.00 Min. : 2.00 Min. : 2.10
## 1st Qu.:36.00 1st Qu.: 6.30 1st Qu.: 9.50
## Median :55.00 Median : 8.70 Median :14.10
## Mean :55.04 Mean :10.15 Mean :13.29
## 3rd Qu.:73.00 3rd Qu.:12.60 3rd Qu.:17.00
## Max. :99.00 Max. :35.60 Max. :24.50
The data frame doesn’t contain any NA across its 2192 observations and 22 variables.
The charts below describe the evolution of each variable over time.
Dark Orange = Main Pollutants | Light Orange = Other Pollutants | Blue = Weather Parameters
We can see a cyclic evolution following the seasons, which suggests that the weather has an influence on the level of some pollutants.
To identify correlations between the variables, we firstly plot a correlation matrix:
Another view provides more information:
First we split the data in train and test (80% - 20%).
set.seed(2018)
train.size <- 0.8
train.index <- sample.int(length(pollution_daily_h$NO2), round(length(pollution_daily_h$NO2) * train.size))
train.sample <- pollution_daily_h[train.index,]
test.sample <- pollution_daily_h[-train.index,]
The Train Sample has 1754 rows and the Test Sample has 438 rows.
We want to define a multilinear regression model in order to explain NO2 with the rest of the variables. By definition, temp_min and temp_max are correlated with temp_avg, so we remove them.We use the variable temp_gap to measure their influence on the model.
multi_model_NO2<-lm(NO2~.-month-week-date-temp_min-temp_max, data=train.sample)
lm_stats <- summary(multi_model_NO2)
print(lm_stats)
##
## Call:
## lm(formula = NO2 ~ . - month - week - date - temp_min - temp_max,
## data = train.sample)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.9368 -3.1235 0.0047 3.2157 21.3306
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.229319 3.113391 0.395 0.693003
## SO2 1.383111 0.077489 17.849 < 2e-16 ***
## O3 -0.149220 0.012032 -12.402 < 2e-16 ***
## PM2.5 0.255679 0.067646 3.780 0.000162 ***
## BEN 3.343452 1.165812 2.868 0.004182 **
## CO 45.100966 3.957109 11.397 < 2e-16 ***
## EBE 2.843966 0.553288 5.140 3.06e-07 ***
## NMHC 1.358368 1.981506 0.686 0.493105
## NO -0.201584 0.017344 -11.623 < 2e-16 ***
## PM10 -0.008354 0.024343 -0.343 0.731516
## TCH 11.070151 1.621469 6.827 1.19e-11 ***
## TOL 2.850827 0.214695 13.279 < 2e-16 ***
## temp_avg -0.122812 0.038525 -3.188 0.001459 **
## precipitation 0.178564 0.041030 4.352 1.43e-05 ***
## humidity -0.098214 0.015546 -6.318 3.36e-10 ***
## wind_avg_speed -0.323058 0.032912 -9.816 < 2e-16 ***
## temp_gap 0.304426 0.053019 5.742 1.10e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.096 on 1737 degrees of freedom
## Multiple R-squared: 0.9113, Adjusted R-squared: 0.9105
## F-statistic: 1116 on 16 and 1737 DF, p-value: < 2.2e-16
Mean squared error is exactly how it sounds: we take the mean of all of our errors squared. This is a good measure for seeing how accurate a model is because we obviously want as little error as possible. In our case the MSE is 5.0961601.
Another thing to look at is the confidence intervals for our coefficients. Our estimates for each coefficient are not exact so we want to find a range where we are at a certain percent confident that the actual value is in this range . We can interpret this like: for every change of one (1) unit in the SO2, we are 95% confident that the NO2 will change between 1.2311291, 1.5350938.
Another important thing is to investigate is if the assumptions regarding linear regression are valid. This can be observed by creating the below 4 plots.
resids_multi_NO2 <- multi_model_NO2$residuals
After this analysis, lets see if making some transformations may be beneficial to our model. First, we implement stepwise regression to find out the significance in variables.
## Start: AIC=5729.65
## NO2 ~ (date + SO2 + O3 + PM2.5 + BEN + CO + EBE + NMHC + NO +
## PM10 + TCH + TOL + month + week + temp_avg + temp_max + temp_min +
## precipitation + humidity + wind_avg_speed + temp_gap) - month -
## week - date - temp_min - temp_max
##
## Df Sum of Sq RSS AIC
## - PM10 1 3.1 45114 5727.8
## - NMHC 1 12.2 45124 5728.1
## <none> 45111 5729.7
## - BEN 1 213.6 45325 5735.9
## - temp_avg 1 263.9 45375 5737.9
## - PM2.5 1 371.0 45482 5742.0
## - precipitation 1 491.9 45603 5746.7
## - EBE 1 686.2 45798 5754.1
## - temp_gap 1 856.2 45968 5760.6
## - humidity 1 1036.6 46148 5767.5
## - TCH 1 1210.5 46322 5774.1
## - wind_avg_speed 1 2502.4 47614 5822.3
## - CO 1 3373.7 48485 5854.2
## - NO 1 3508.4 48620 5859.0
## - O3 1 3994.5 49106 5876.5
## - TOL 1 4579.1 49691 5897.2
## - SO2 1 8274.0 53385 6023.0
##
## Step: AIC=5727.77
## NO2 ~ SO2 + O3 + PM2.5 + BEN + CO + EBE + NMHC + NO + TCH + TOL +
## temp_avg + precipitation + humidity + wind_avg_speed + temp_gap
##
## Df Sum of Sq RSS AIC
## - NMHC 1 11.8 45126 5726.2
## <none> 45114 5727.8
## + PM10 1 3.1 45111 5729.7
## - BEN 1 226.7 45341 5734.6
## - temp_avg 1 268.4 45383 5736.2
## - precipitation 1 489.2 45604 5744.7
## - EBE 1 684.2 45799 5752.2
## - temp_gap 1 860.1 45974 5758.9
## - PM2.5 1 1026.3 46141 5765.2
## - humidity 1 1067.1 46181 5766.8
## - TCH 1 1212.3 46327 5772.3
## - wind_avg_speed 1 2579.1 47693 5823.3
## - CO 1 3375.5 48490 5852.3
## - NO 1 3519.4 48634 5857.5
## - O3 1 4095.4 49210 5878.2
## - TOL 1 4582.1 49697 5895.4
## - SO2 1 8271.4 53386 6021.0
##
## Step: AIC=5726.23
## NO2 ~ SO2 + O3 + PM2.5 + BEN + CO + EBE + NO + TCH + TOL + temp_avg +
## precipitation + humidity + wind_avg_speed + temp_gap
##
## Df Sum of Sq RSS AIC
## <none> 45126 5726.2
## + NMHC 1 11.8 45114 5727.8
## + PM10 1 2.6 45124 5728.1
## - BEN 1 219.7 45346 5732.7
## - temp_avg 1 285.4 45412 5735.3
## - precipitation 1 493.7 45620 5743.3
## - EBE 1 674.1 45800 5750.2
## - temp_gap 1 863.8 45990 5757.5
## - PM2.5 1 1015.2 46141 5763.2
## - humidity 1 1089.1 46215 5766.1
## - TCH 1 2061.8 47188 5802.6
## - wind_avg_speed 1 2568.7 47695 5821.3
## - NO 1 3616.4 48743 5859.4
## - CO 1 3638.7 48765 5860.2
## - O3 1 4133.0 49259 5877.9
## - TOL 1 4701.8 49828 5898.1
## - SO2 1 8611.7 53738 6030.6
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## NO2 ~ (date + SO2 + O3 + PM2.5 + BEN + CO + EBE + NMHC + NO +
## PM10 + TCH + TOL + month + week + temp_avg + temp_max + temp_min +
## precipitation + humidity + wind_avg_speed + temp_gap) - month -
## week - date - temp_min - temp_max
##
## Final Model:
## NO2 ~ SO2 + O3 + PM2.5 + BEN + CO + EBE + NO + TCH + TOL + temp_avg +
## precipitation + humidity + wind_avg_speed + temp_gap
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 1737 45111.36 5729.651
## 2 - PM10 1 3.058383 1738 45114.42 5727.770
## 3 - NMHC 1 11.762300 1739 45126.18 5726.227
Based on the above mentioned results, we get the following formula and the variables removed should be PM10 and NMHC. So, we create the model.
We can still notice a really high R-squared of value 0.9112981. After that, we want to treat multicollinearity with the VIF Method (variance inflation factors).
As a general rule, if VIF is larger than 5, then multi collinearity is assumed to be high. So,by starting with all the variables in the model, we are going to * calculate the VIF values, * remove the biggest one, * re-do the model until all the explanatory variables have a VIF below 5.
## SO2 O3 PM2.5 BEN CO
## 2.731960 4.737638 2.589103 15.175970 21.433198
## EBE NO TCH TOL temp_avg
## 3.384605 15.882579 2.124477 11.084651 6.505867
## precipitation humidity wind_avg_speed temp_gap
## 1.218099 6.663257 1.991433 4.170651
For practicallity we decided to handle the above procedure with a WHILE loop.
selectedMod <- step_0
all_vifs <- car::vif(selectedMod)
print('VIF Values of our Initial Explanatory Variables:')
## [1] "VIF Values of our Initial Explanatory Variables:"
print(all_vifs)
## SO2 O3 PM2.5 BEN CO
## 2.731960 4.737638 2.589103 15.175970 21.433198
## EBE NO TCH TOL temp_avg
## 3.384605 15.882579 2.124477 11.084651 6.505867
## precipitation humidity wind_avg_speed temp_gap
## 1.218099 6.663257 1.991433 4.170651
signif_all <- names(all_vifs)
while(any(all_vifs > 5)){
var_with_max_vif <- names(which(all_vifs == max(all_vifs))) # get the variable with max vif
signif_all <- signif_all[!(signif_all) %in% var_with_max_vif] # re move this variable
myForm <- as.formula(paste("NO2~ ", paste (signif_all, collapse=" + "), sep="")) # design the new formula
selectedMod <- lm(myForm, data=train.sample) # re-build model with new formula
all_vifs <- car::vif(selectedMod)
}
print('VIF Values of our Final Explanatory Variables:')
## [1] "VIF Values of our Final Explanatory Variables:"
print(all_vifs)
## SO2 O3 PM2.5 EBE TCH
## 1.958788 3.351606 2.049109 1.829030 1.781684
## temp_avg precipitation wind_avg_speed temp_gap
## 3.056575 1.154820 1.743077 2.058636
So our Final Model after removing the multicollinear variables is
## NO2 ~ SO2 + O3 + PM2.5 + EBE + TCH + temp_avg + precipitation +
## wind_avg_speed + temp_gap
##
## Call:
## lm(formula = NO2 ~ SO2 + O3 + PM2.5 + EBE + TCH + temp_avg +
## precipitation + wind_avg_speed + temp_gap, data = train.sample)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.9354 -4.1242 0.1759 3.9862 18.8808
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.34165 2.32129 -0.578 0.5634
## SO2 1.70825 0.07619 22.421 < 2e-16 ***
## O3 -0.23205 0.01174 -19.774 < 2e-16 ***
## PM2.5 0.56677 0.03988 14.213 < 2e-16 ***
## EBE 8.34368 0.48211 17.307 < 2e-16 ***
## TCH 17.17983 1.44013 11.929 < 2e-16 ***
## temp_avg -0.13117 0.03131 -4.189 2.94e-05 ***
## precipitation 0.14279 0.04770 2.993 0.0028 **
## wind_avg_speed -0.35111 0.03630 -9.672 < 2e-16 ***
## temp_gap 0.77269 0.04452 17.355 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.095 on 1744 degrees of freedom
## Multiple R-squared: 0.8727, Adjusted R-squared: 0.872
## F-statistic: 1328 on 9 and 1744 DF, p-value: < 2.2e-16
Let’s see what a 10-Fold Cross validation will tell us about our model:
## [1] "INITIAL MODEL:"
## Linear Regression
##
## 1754 samples
## 21 predictor
##
## Pre-processing: centered (16), scaled (16)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1578, 1579, 1579, 1579, 1578, 1579, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 5.13777 0.9079525 3.97799
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
## [1] "FINAL MODEL:"
## Linear Regression
##
## 1754 samples
## 9 predictor
##
## Pre-processing: centered (9), scaled (9)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1578, 1578, 1579, 1579, 1578, 1578, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 6.112716 0.8716267 4.866125
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Let’s have a look at the predictions for both models
test.sample$NO2_predicted_model_final <- predict(multi_model_NO2_final,test.sample)
test.sample$NO2_predicted_model_0 <- predict(multi_model_NO2_0,test.sample)
We show randomly rows 80-90 of the results. We can see that in some cases the original model is better, but also the other way around.
## NO2 NO2_predicted_model_0 NO2_predicted_model_final
## 428 32.90278 33.47543 38.52169
## 436 41.62434 45.47223 48.86567
## 438 66.43728 62.02274 58.57356
## 441 37.94618 36.13524 35.07174
## 443 20.01215 21.83412 21.85867
## 447 58.33043 54.81361 52.00605
## 464 35.96528 37.69067 40.09754
## 470 18.65799 15.89007 17.62860
## 482 28.10417 21.46581 22.55336
## 489 45.54340 37.47732 37.65892
## 490 31.50521 22.42451 23.59314
Let’s visualize it:
Blue = Initial Model | Orange = Final Model
Now lets compare the 2 models (initial and final):
## Analysis of Variance Table
##
## Model 1: NO2 ~ (date + SO2 + O3 + PM2.5 + BEN + CO + EBE + NMHC + NO +
## PM10 + TCH + TOL + month + week + temp_avg + temp_max + temp_min +
## precipitation + humidity + wind_avg_speed + temp_gap) - month -
## week - date - temp_min - temp_max - PM10 - NMHC
## Model 2: NO2 ~ SO2 + O3 + PM2.5 + EBE + TCH + temp_avg + precipitation +
## wind_avg_speed + temp_gap
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1739 45126
## 2 1744 64788 -5 -19661 151.54 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Our Final Model provides a way to predict the NO2 pollution level based on 9 pollutants and 5 weather parameters.